Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update cythonization in CI #4764

Merged
merged 11 commits into from
Apr 29, 2021
Merged

Update cythonization in CI #4764

merged 11 commits into from
Apr 29, 2021

Conversation

jrbourbeau
Copy link
Member

This is just a sanity check to see what CI test builds are using the cythonized scheduler. cc @jakirkham @crusaderky

xref #4760 (comment)

@jrbourbeau
Copy link
Member Author

Note that I expect this to fail for all builds except Python 3.7 as that's the one build in which we are cythonizing things

- name: Cythonize
shell: bash -l {0}
if: ${{ matrix.python-version == '3.7' }}
run: python setup.py build_ext --with-cython

@jrbourbeau
Copy link
Member Author

Alright, as @jakirkham mentioned in #4760 (comment), our current CI setup isn't testing the cythonized scheduler correctly. I've updated our CI installation step and added a small test to ensure that we're cythonizing the scheduler as expected.

@jakirkham
Copy link
Member

Yeah that doesn't surprise me. Had tried to get this to work back when GitHub Actions was only used for Windows ( #4326 ), but ran into difficulties. Not having access to Windows locally and the fact we are primarily Cythonizing the Scheduler on Linux, made it hard to prioritize further work to fix CI. Agree it would be good to get this working and it may make more sense to do now as we are making fewer changes to the Scheduler

@jrbourbeau
Copy link
Member Author

Sounds good, thanks for taking a look at this @jakirkham. I'm waiting for test_cythonized to pass on all CI builds and then I'll run the full test suite to see what happens

@jakirkham
Copy link
Member

I think one of the issues that we are likely running into here is the fact that we have both a .py and a .so file, both of which have the same name. So how Python determines, which one is loaded affects whether the Cythonized Scheduler is used. On Linux and macOS, it typically seems to load .so files preferentially. On Windows this seems to be the other way around. Ideally we wouldn't leave it up to the interpreter to choose between these two. Part of what that PR was trying to do was trying to do was ensure only one of these files is installed

@jrbourbeau
Copy link
Member Author

Ah, sorry I missed that you linked to #4326. Would you prefer for us to move back to that PR instead of what I have here?

@jakirkham
Copy link
Member

Not at all. That was more for context. The linked PR is pretty old and likely needs to be refreshed. Happy to just start here with what you have 🙂

@jrbourbeau jrbourbeau changed the title [WIP] Check when scheduler is cythonized in CI [WIP] Update cythonization in CI Apr 28, 2021
@jakirkham
Copy link
Member

jakirkham commented Apr 28, 2021

Woot! 🎉 Looks like we are now reproducing issue ( #4760 ) on CI

Got error on 'http://localhost:8787/status/': HTTPError: 500 Server Error: Internal Server Error for url: http://localhost:8787/status/
...
distributed/cli/tests/test_dask_scheduler.py::test_dashboard

@jrbourbeau
Copy link
Member Author

Okay, so it looks like the distributed/cli/tests/test_dask_scheduler.py::test_dashboard failures for the cythonized builds are related to #4760 and I can confirm that locally when I check out #4761 and use the cythonized scheduler, distributed/cli/tests/test_dask_scheduler.py::test_dashboard passes

@jakirkham
Copy link
Member

So how should we handle incorporating these PRs? Does GitHub Actions have a concept of allowed failures? Or are we ok with having a failure that we clear out with a later PR?

@jrbourbeau
Copy link
Member Author

Hmm good question. I'd like to merge #4761 and then merge main here to see if there are other failures. If there are, we can open up separate PRs to fix them, merge them into main, and then check back here until everything is green. What do you think about that?

@jakirkham
Copy link
Member

SGTM. Happy to dig into any additional failures we find.

FWIW have usually been running the Scheduler tests locally and sometimes other components (like Client and Worker tests), but not always the full test suite. So it is possible that we encounter some failures that are a bit more off this beaten path.

@jakirkham
Copy link
Member

Had some lingering changes to add a global variable to the Scheduler file indicating whether it was compiled or not. Cleaned them up and submitted them as PR ( jrbourbeau#2 ) based off this one. This should make it a bit easier to check whether the Scheduler was Cythonized. Also this can be used to mark tests as xfailed when the Scheduler is Cythonized if needed as well. Hopefully that will let us get this change in sooner without needing to wait for individual tests failures to go in beforehand

* Add `COMPILED` global variable to scheduler

This should make it easy to tell whether the scheduler was compiled with
Cython or not.

* Simplify Cythonized Scheduler check
@jrbourbeau
Copy link
Member Author

It looks like some tests are failing because distributed.is_coroutine_function isn't recognizing some Scheduler methods, e.g. Scheduler.restart, as coroutine functions when the scheduler is cythonized.

With cythonization:

In [1]: from distributed import Scheduler

In [2]: Scheduler.restart
Out[2]: <cyfunction Scheduler.restart at 0x120db4d40>

In [3]: from distributed.utils import is_coroutine_function

In [4]: is_coroutine_function(Scheduler.restart)
Out[4]: False

Without cythonization:

In [1]: from distributed import Scheduler

In [2]: Scheduler.restart
Out[2]: <function distributed.scheduler.Scheduler.restart(self, client=None, timeout=3)>

In [3]: from distributed.utils import is_coroutine_function

In [4]: is_coroutine_function(Scheduler.restart)
Out[4]: True

Is there a way we can have cython maintain coroutine functions?

@jrbourbeau
Copy link
Member Author

Another class of failures is due to a cloudpickle serialization issue:

Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/protocol/serialize.py", line 329, in serialize
    header, frames = dumps(x, context=context) if wants_context else dumps(x)
  File "/home/runner/work/distributed/distributed/distributed/protocol/serialize.py", line 55, in pickle_dumps
    protocol=context.get("pickle-protocol", None) if context else None,
  File "/home/runner/work/distributed/distributed/distributed/protocol/pickle.py", line 60, in dumps
    result = cloudpickle.dumps(x, **dump_kwargs)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.7/site-packages/cloudpickle/cloudpickle_fast.py", line 563, in dump
    return Pickler.dump(self, obj)
_pickle.PicklingError: Can't pickle <cyfunction Scheduler.__init__.<locals>.<lambda> at 0x7f7cd474d390>: attribute lookup lambda5 on distributed.scheduler failed

@jakirkham
Copy link
Member

Thanks James! 😄 WDYT about marking all of these xfail and merging this PR?

I think the issue in these cases is Cython constructing functions for some of these more atypical cases (closures, lambdas, coroutines, etc.) such that they work in C, but that may come with some tradeoffs (like not being able to pickle them). Some of these might be solved with a small rewrite or similar. Some of these may be bugs in Cython. Will take a look and see what we can do for these.

@jrbourbeau
Copy link
Member Author

WDYT about marking all of these xfail and merging this PR?

+1 marking tests right now

@jrbourbeau jrbourbeau changed the title [WIP] Update cythonization in CI Update cythonization in CI Apr 28, 2021
@jakirkham jakirkham merged commit 56aed44 into dask:main Apr 29, 2021
@jakirkham
Copy link
Member

Thanks James! 😄

@jrbourbeau jrbourbeau deleted the check-cythonized branch April 29, 2021 02:10
@jakirkham
Copy link
Member

Opened PR ( #4767 ) to start working on fixing the skipped/xfailed tests

@jakirkham
Copy link
Member

Fixing a couple of pickling issues with PR ( #4768 )

@jakirkham
Copy link
Member

It looks like some tests are failing because distributed.is_coroutine_function isn't recognizing some Scheduler methods, e.g. Scheduler.restart, as coroutine functions when the scheduler is cythonized.

With cythonization:

In [1]: from distributed import Scheduler

In [2]: Scheduler.restart
Out[2]: <cyfunction Scheduler.restart at 0x120db4d40>

In [3]: from distributed.utils import is_coroutine_function

In [4]: is_coroutine_function(Scheduler.restart)
Out[4]: False

Without cythonization:

In [1]: from distributed import Scheduler

In [2]: Scheduler.restart
Out[2]: <function distributed.scheduler.Scheduler.restart(self, client=None, timeout=3)>

In [3]: from distributed.utils import is_coroutine_function

In [4]: is_coroutine_function(Scheduler.restart)
Out[4]: True

Is there a way we can have cython maintain coroutine functions?

Not sure atm, but did find an MRE. Filed as issue ( cython/cython#4138 )

@da-woods
Copy link

I've posted this on the Cython bug but: asyncio.iscoroutinefunction is more flexible that inspect.iscoroutinefunction and recent versions of Cython will pass the test.

return inspect.iscoroutinefunction(f) or gen.is_coroutine_function(f)

I think the consensus for Cython is that the inspect test is too restrictive, and Cython can't pass that.

@jakirkham
Copy link
Member

Yeah that makes sense. Thanks for the advice 🙂

Was looking through the inspect module, but wasn’t seeing a good way to extend it.

Think we could use asyncio here. We already have a utility function for this kind of check where we could make that change pretty easily

@jakirkham jakirkham mentioned this pull request Apr 29, 2021
3 tasks
@jakirkham
Copy link
Member

Submitted PR ( #4771 ) to switch to the asyncio implementation. Thanks again for the suggestion 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants